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c^ Abstract 

^' The generation of novelty is central to any creative endeavor. Novelty generation and the relationship 

(~| , between novelty and individual hedonic value have long been subjects of study in social psychology. However, 

^ Oh I few studies have utilized large-scale datasets to quantitatively investigate these issues. Here we consider the 

domain of American cinema and explore these questions using a database of films spanning a 70 year period. 

We use crowdsourced keywords from the Internet Movie Database as a window into the contents of films, and 

prescribe novelty scores for each film based on occurrence probabilities of individual keywords and keyword- 



00 ' pairs. These scores provide revealing insights into the dynamics of novelty in cinema. We investigate 

^^ . how novelty infiuences the revenue generated by a film, and find a statistically significant relationship that 

• [ resembles the Wundt-Berlyne curve. We also study the statistics of keyword occurrence and the aggregate 

S I distribution of keywords over a 100 year period. 

m ■ 

•• Introduction 

> 

• I— I 

^ ; Over the last century, cinema has carved out an indelible niche in human culture, and filmmaking has come 
3 I to be regarded as an art-form its own right. The film industry of the United States in particular, has had a 

" ■ ■ major infiuence on the evolution of cinema over the course of its history, and is currently the third largest 
producer of films in the world, with a global audience and a gross turnover averaging 29.5 billion US dollars 
over the last five years reported [1]. Despite the fact that trends associated with films, the dissection of 
their respective successes and failures, and their individual artistic merit are all subjects of avid debate and 
discussion in the public realm, and although the economics of film has been extensively researched [2j, no 
studies, to our knowledge, have quantitatively analyzed the large scale features of novelty in film plots and 
the patterns associated with their evolution. With the advent of culturomics as an emerging science |3], 
it is natural to attempt to bridge this gap with the aid of comprehensive sources of film data such as the 
Internet Movie Database (IMDb). 

The Internet Movie Database (www.imdb.com) is a comprehensive online database containing informa- 
tion on films, television programs and videogames which, according to the site, has "more than 100 million 



data items including more than 2 million movies" . This in large part is made possible by allowing registered 
users of the site to add new database items or edit the information associated with existing ones. One 
such category of user-generated information at the center of this study, is that of plot-keywords consisting 
of single words, or word-strings associated with each item. If a keyword proposed by a user is semanti- 
cally close to a keyword that already exists (i.e., has already been created for association with one or more 
films), then the user is prompted to use the existing keyword, thus suppressing the creation of synonymous 
keywords. In the context of films, keywords describe any of a number of aspects of film including but not 
limited to thematic plot-elements {father-son-relationship, power, fame), specific story elements {tied-to-a- 
chair, held- at- gunpoint, breaking-and-entering), location references (nianhattan-new-york-city, coffee-shop, 
Chevron- gas- station) specific visual or object references {life-magazine, characters-point- of-view- camera- 
shot, coin- flipping-in-the- air) or high-level features of the film {independent-film, female-nudity, cult- film). 
Plot-keywords are thus qualitative descriptors spanning several scales of detail and specificity, and they 
potentially constitute a rich information set capable of yielding valuable insights into the evolution of films 
over time. 

The dynamics of tagging - the process of users contributing keywords to associate with specific items - as 
well as folksonomy - the classification of items based on these collective tags - have been widely studied in the 
context of blogs, photo-sharing and social bookmarking [11 [U El [71 lU [H [101 [HI [El [13] . A general consensus 
derived from these studies is that despite a lack of central control, shared vocabularies with stable probability 
distributions over words emerge as a result of collaborative tagging. For example, Halpin et al. [4J showed 
that the relationship between the frequency of a tag's usage and its rank (based on how frequently it is 
used) is a power-law, and further proposed a model for tagging dynamics based on preferential attachment 
that could yield such a relationship. Almost concurrently, Cattuto et al. [5] showed that the frequency-rank 
plot for tags obtained from Del.icio.us and Connotea indicated a power-law relationship, and demonstrated 
that a Yule-Simon model with long-term memory for tagging dynamics could yield this relationship. In 
the context of information retrieval. Levy and Sandler [9j showed how social tags associated with musical 
tracks (on a Last.fm dataset) defined a semantic space that could enable efficient mood-based clustering and 
retrieval. Similarly, there have been several studies [lOl [TTl [121 [H] that have focussed on exploring the use 
of tags for personalized recommendation and query based retrieval. As a representative example, Szomnsor 
et al, jll] investigated the extent to which combining tags obtained from IMDb and ratings data obtained 
from Netflix could generate better taste profiles for users, and thus yield a predictor of their ratings for an 
unseen film. 

In contrast to the above studies, the motivation of this work is to utilize the IMDb plot-keywords 
dataset as a window into the evolution of films and their content over the course of the last century, 
and in the process investigate certain aspects of novelty generation in the arts. The characterization of 
novelty, and the processes that lead to it, have been subjects of thorough investigation in psychology and 
social science [141 fT5l [16] . Several of these studies emphasize the role of the combinational process - one 
that combines existing ideas in a manner not encountered earlier - in novelty generation, in contrast to 
the process of introducing fundamentally new concepts from scratch. Another aspect of sustained research 
interest [171 [THl UHl [20] is the relationship between the novelty of an item and the hedonic value (or pleasure) 
derived by an individual upon its consumption. The standard paradigm here, resulting from the pioneering 
work of Wundt |21) and Berlyne |22], is captured by the Wundt-Berlyne curve, which posits that increasing 
novelty initially results in increasing hedonic value until it reaches a maximum. Further increasing novelty 
beyond this intermediate level results in a rapid decline in hedonic value. In summary, the Wundt curve 
argues that individuals seek a balance between familiarity and novelty, shying away from the banal as 
strongly as from the radically unfamiliar. 

The issues of novelty creation and novelty optimization are undoubtedly relevant to the business of 
cinema. A significant portion of film criticism, commentary and discussion is devoted to analyzing the 
novelty in the writing and execution of film plots. In addition, one among the various factors responsible 



in successfully securing the financing and distribution of a film, is its conformity to current trends and past 
conventions. However, little is known in a quantitative sense regarding the degree to which the competing 
objectives of novelty and conformity are balanced in the process of new content creation. The plot-keywords 
dataset has the potential to serve as a starting point in addressing these issues. In addition, it allows us to 
ascribe novelty scores to films on the basis of their content, including not just elements of the underlying 
story, but also elements that encapsulate the tone and style of the final finished product. With this goal 
in mind, we analyze the plot-keywords associated with films produced in the United States over the period 
between and including the years 1890 and 2011, define two novelty scores based on them, and study the 
aggregate patterns in novelty evolution over a 70 year period. In addition, we also provide a number of 
quantitative insights into the probability distribution of plot-keywords over the entire dataset spanning 100 
years, and the statistics of their use over time. 

Results 

We begin by presenting some basic characteristics of the dataset under consideration. Henceforth for brevity, 
we will refer to plot-keywords simply as "keywords". 

Statistics of films and tagging 
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Figure 1: (a) The total number of English language films produced in the United States (in blue), and 
the number of films remaining after filtering out short films, documentaries and adult films (in red), per 
year, (b) Number of films in the filtered set (red) and number of films in the filtered set with keywords 
(green), per year. The shaded gray region bounds the values which lie within 25% of the total number of 
films released. In the period between and including the years 1929 and 1998 the green curve lies within 
the shaded region showing that greater than 75% of films released each year in this period have keywords 
associated with them. 



Figure [T][a) shows the total number of English-language films originating in the US (see Methods for 
details) each year starting from the earliest recorded entry in the year 1890 through 2011. The number 
of films produced increases sharply starting around 1907, and corresponds to the "Nickelodeon boom" i.e., 
the sudden increase in the production of films as a result of the success of the Nickelodeon theater in 1905, 
which led to the proliferation of theaters devoted to film projection for a mass audience. The majority of 



the films produced in tliis period had runtimes of 10-15 minutes [23], and are classified as "Short" under 
the IMDb-Genre field. To obtain the dataset that forms the core of this study, we considered only feature 
length films, and additionally only those which were non-adult and non-documentary theatrical releases. As 
expected the peak around 1910 disappears in the filtered set. Analogous to the Nickelodeon boom, there is 
a sharp rise in the number of films around the mid-1990s. This is a manifestation of the dramatic increase 
in independent-film production that occurred in the 1990s and that, by the end of the decade, led to over 
half the feature length films being produced coming from independent studios and producers |24j . 

Figure [TJb) shows the statistics for the tagging of films released in the period between 1890 and 2011. 
Clearly, the association of keywords to films is not consistent over the different release years, with a clear 
paucity in tagging towards the early (the first film associated with a keyword was released in 1910) and late 
years in the period under consideration. However, for years in the period including and between the years 
1929 and 1998, more than 75% of the films released each year have keywords associated with them. For our 
studies on the novelty of films, we therefore focus on the films released within this period. In total, there 
are 21,583 films possessing at least one keyword in this period. 

We refer to the collection of all keywords associated with a film as the film's keyword set. The length 
of keyword sets appears to be exponentially distributed (see Fig. [2] (a)), with the median length being 14 
keywords. For the restricted set between 1929 and 1998, the median length increases slightly to 19, but the 
distribution remains qualitatively similar (not shown). As expected, films in the tail mostly comprise of 
popular mainstream films, as shown in Fig. EJ^b) for each decade from the 1930s to the 2000s. 

Studies on the Google n-gram corpus have demonstrated that trajectories of word-occurrence- frequency 
over time can refiect surges of cultural interest in specific events, literary works, persons etc. [31 [25]. We can 
expect to glean similar insights from observing the usage of plot-keywords. We begin by defining occurrence 
frequency per year for a given keyword as the number of films released that year that are tagged with 
the keyword, divided by the total number of films released that year. Figure [3l^a) shows trajectories of 
occurrence frequency for four example keywords. Similarly as observed for words in literature [31 [25], films 
too display a temporally local burst in the usage of a plot-element as can be seen in the example of "world- 
war-two". A surge in the occurrence of "class-difference" around 1985 is suggestively coincident with the 
conjectured rise in materialistic attitudes during the 1980s [261127]. 

Beyond the temporally local trends seen in the association of keywords with films, there could also be 
long-range correlations present. To probe this further, we use the method of detrended fiuctuation analysis 
(DFA) [28j that is widely employed for investigating the presence of long-range correlations in general time- 
series, and has also been specifically used in the context of word usage [25j. We analyzed using DFA (see 
Methods), the time series of keyword occurrence frequency for all keywords that appeared in at least 75 of 
the years between the period 1910 - the earliest year with a tagged film - and 2011. In total, there are 461 
such keywords. The Hurst exponent a which signals the presence or absence of long range correlations is 
obtained for each of these time series using DFA. A value of a = 0.5 indicates no temporal correlations, 
a < 0.5 indicates negative correlations while a > 0.5 indicates positive correlations. A distribution of the 
Hurst exponents obtained for the 461 time series considered is shown in Figure [3I|b), indicating the presence 
of long-range positive correlations in the keyword occurrence frequency. These correlations disappear (see 
Fig. [31^b) inset) for the set of time series obtained after shuffling the temporal order of data within each 
individual time series. 

Evolution of film novelty 

Next, we devise a method to assign a novelty score to each film on the basis of the keywords associated 
with it and the keywords appearing in all films that were released prior to it. The assignment of novelty 
scores is done for films in the continuous period between 1929 and 1998, more than 75% of which per year 
are associated with keywords. Incidentally, the year 1929 also marks the time around which sound in films 
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Figure 2: (a) The distribution of keyword set lengths over all films with keywords. The linear decay on the 
linear-log plot indicates a roughly exponentially declining probability as the keyword set length increases, 
(b) Length of the keyword set for the chronologically ordered set of films with keywords. The gray bars 
indicate the lengths of the sets for the different films. For each decade, the film with the longest keyword 
set over all releases in that decade is highlighted in red. 



became ubiquitous [23] . the beginning of the period which came to be known as the golden age of Hollywood 
|29j . and the year in which the first ever academy awards were presented. We formally present the definition 
of the novelty score below. 
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Figure 3: (a) The yearly occurrence frequency of specific keywords as a function of time, (see text for 
details) (b) Distribution (relative frequencies) of the Hurst exponent a for keywords that occur in at least 70 
years between the period 1910 and 2011. The mean value of the exponent is 0.8966, indicating the presence 
of positive long-range correlations. The inset shows the distribution after shuffling each of the time series. 
The correlations largely disappear upon shuffling as indicated by the mean value of 0.5590 obtained for a. 



For a film i, denote by M* the set of all films that appear prior to the release of film i. We use m to 
index an arbitrary film, and Km to be the set of keywords associated with m. We begin by computing the 
probability P{w) of observing a keyword w over the set of films M* U {i} for all keywords appearing in the 
set. 



\M'\ + 1 
where 1a denotes the indicator function for set A: 

1a{x) 



meM*U{i} 

' 1 if X G A 
ifx^A 



(2) 



Then, for any keyword w, the quantity -log P{w) is a standard measure of the "surprise" in observing 
keyword w [30j. With this in mind, we quantify the novelty of film i, as the average surprise over all keywords 
associated with the film. Although, ideally, P{w) should designate the prior probability distribution i.e., 
the probability distribution for keywords computed over films in M*, we include film i in its computation in 
order to circumvent the ill-defined logarithm arising when P{w) = i.e., when w appears for the first time 
in Ki. Thus, the first measure of novelty we define, scores the film on the basis of how rarely, on average, 
the elements associated with it have appeared in films in the past. Formally we write the e/emento/- novelty 
for film i as: 

-^^ = -7^ El°g^H (3) 

It is worth noting that each term in the sum in Eq. [3] is identical to the Term- Frequency-Inverse-Document- 
Frequency (TFIDF) score for the associated keyword, which is commonly used in query-based information 
retrieval |31j . 

While Eq. [3] scores films based on the rarity or abundance of their individual plot-elements, it is agnostic 
to how rare or abundant the combinations of their plot-elements are. To capture the novelty associated with 
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the combinations of keywords, we can define similarly to Eq. [3l the novelty resulting from the occurrence of 
specific keyword-pairs, triples and so on. Here we restrict our study of higher-order terms to keyword-pairs 
and formally write the combinatorial novelty for film i as: 

where P{u, v) is the probability of keywords u and v occurring together in a film in the set M* U {i} (defined 
similarly as for individual keywords in Eq. [1]) . Both Af^ and TV^ have the same maximum attainable value of 
log(|M*| + 1) (the lower bound is 0), but capture two distinct aspects of novelty generation. Thus observing 
trends in their evolution over time, not only gives us insights pertinent to specific events in the history of 
cinema, but also helps elucidate the degree to which elemental and combinatorial novelty contribute to the 
creation of new content. 

Figure HJa) shows the chronological evolution of elemental novelty over the period 1929-1998. To elim- 
inate situations where a film with a small keyword set registers a very high (very low) novelty due to the 
rarity (abundance) of its few keywords, we only consider films with keyword sets of length greater than 10 
(see SI Section 1.1). Films are chronologically ordered by the time of release, and the abscissa is simply the 
index i of the films, with the vertical dashed lines corresponding to the indices demarcating the beginning 
of a new decade. The maximum attainable value log(|M*| + 1) is indicated by the red dashed line. What 
is clearly observed is that the median value of elemental novelty is well separated from the upper bound 
of novelty over the entire period. Some features in the evolution also bear pointing out. For example, a 
marked upward trend can be seen around the mid-1960s in both the yearly median, as well as the envelope 
of the time series, which agrees well with the documented birth of the American New Wave which brought 
with it a marked shift in themes, style and modes of production [23]. Interestingly, the period between 1929 
and 1945, commonly referred to as the golden age of Hollywood, is not marked by an increase in or a stable 
value of median novelty, but rather by a subtle decline. This decline is likely a consequence of the practice 
of block booking prevalent in that period, which by virtually guaranteeing exhibition for any film as long 
as it came from a major studio, did little to de-incentivize the production of films with low novelty [2|l23j. 

Figure Hl^b) analogously shows the evolution of combinatorial novelty over the period, whose upper 
envelope in contrast to elemental novelty, consistently stays close to the maximum attainable value. Gross 
features similar to those seen for elemental novelty can also be seen here. For example, Afc rises in the 1960s 
and its variance decreases, while in contrast, the variance shows an increasing trend during the "golden age" 
between 1929 and 1945. 

Figure \M,c) and (d) respectively show the distributions of elemental-novelty and combinatorial-novelty 
for films in each of the 7 decades in the period considered. While no clear trend is observed in the case of Me, 
the distribution of Afc appears to sample progressively higher values with each passing decade. Figure Hfe) 
and (f ) respectively show the distribution of elemental-novelty and combinatorial-novelty for the aggregated 
set of films (with keywords) between 1929 and 1998. 

We also investigate the evolution of elemental and combinatorial novelty for films within specific genres, 
and these reveal trends unique to each of them. For example. Fig. [5^a), (b) show the evolution of novelties 
for films containing "Action" as one of their IMDb genre classes while Fig. [5]^c) and (d) show the case for 
films under the "Sci-fi" genre. The median and the envelope curves of both Me and Mc for the case of 
action films, show a sudden disruptive jump to higher values in the decade 1960-70. This is compatible 
with the thesis, based on studies by film historians, that elements comprising the modern action film genre 
originated with the James Bond franchise in the 1960s [32j . Similar plots for selected other genres are shown 
in Supplementary Figure 2. 
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Figure 4: The evolution of (a) elemental novelty and (b) combinatorial novelty for films between 1920 and 
1998. The solid red curve shows the median yearly novelty, and the gray envelope curves show the novelty 
of the 5th and 95th percentile of films each year. The dashed red curve shows the maximum possible 
value of novelty that could have been achieved for each film. Distributions of (c) elemental novelty and 
(d) combinatorial novelty by decade. Distribution of (e) elemental and (f) combinatorial novelty for the 
aggregated set of films released between 1929 and 1998. 



Relationship between film novelty and revenue 

Next, motivated by the Wundt-Berlyne curve, we investigate whether there is any relationship between 
the novelty of a film and the hedonic value derived from its consumption at an aggregate population level. 
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Figure 5: (a) Elemental novelty and (b) combinatorial novelty for films containing 'Action' within their 
'genre' field on IMDb. (c) Elemental novelty and (d) combinatorial novelty for films containing 'Sci-Fi' 
within their 'genre' field on IMDb. The solid red curve shows the median yearly novelty, and the gray 
envelope curves show the novelty of the 5th and 95th percentile of films each year. 



Following arguments outlined in Supplementary Text, Section 1.2, we utilize the (inflation-adjusted) revenue 
generated by the film as a measure of its mass appeal. 

Figure [6|^a) shows the mean revenue of films as a function of their elemental novelty. The mean revenue 
appears to increase steadily and then precipitously drops beyond Me = 6. This behavior bears a resemblance 
to the Wundt-Berlyne curve |2H [22] (see inset) between novelty and hedonic value. To ensure that this 
relationship is statistically significant, we generate 50000 randomized versions of data where the revenues 
are shuffled. The gray curves bound the region between the lO*'^ and 90*'^ percentile of the mean revenue 
values obtained for the randomized data, for each (binned) value of novelty. For predominant portions of 
the range of novelty, the true revenue values lie outside of the region bounded by these curves, indicating 
a relationship that differs significantly from that expected by chance. Moreover, in light of the symmetric 
probability density function of novelty (Fig. El^b)), the asymmetry in the shape of the mean revenue curve 
further suggests that novelty and revenues are not independent of each other. To conclusively establish the 
statistical significance of the association between Me and revenue, we compare their mutual information (MI) 
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Figure 6: (a) Mean inflation-adjusted revenue versus elemental novelty A/g is shown by the black curve. The 
gray dots and squares show the 10*'^ and 90*^^ percentile of mean revenues obtained for 50000 randomized 
versions of the data, (b) The probability density function of Me for the data used in (a), (c) The rela- 
tive frequencies of mutual information values between Me and mean revenue obtained for the randomized 
datasets, compared to the mutual information for the true dataset. (d) Mean inflation-adjusted revenue 
versus combinatorial novelty Mc (black curve) and the 10*^ and 90*'' percentile (gray dots, gray squares 
respectively) of mean revenues obtained for 50000 randomized versions of the data, (e) The probability 
density function of Ac for the data used in (d). (f) The relative frequencies of mutual information values 
between Mc and mean revenue obtained for the randomized datasets, compared to the mutual information 
for the true dataset. 

(see Methods) in the true data to that present in the randomized versions. Figure [6|^c) shows the distribution 
(relative frequencies) of MI obtained for the 50000 randomized versions, to the value of 0.0558 obtained for 
the true data. None of the randomized versions exhibited an MI greater than or equal to 0.0558, indicating 
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a p-value less than 2 x 10 ^. The standard deviation of mean revenues also shows a Wundt-Berlyne-like 
behavior (standard deviation and raw scatter of revenue and novelty show in Supplementary Figures 3 and 
4.). _ 

Figure [6|^d),(e) and (f) show analogous results for combinatorial novelty Mc- Once again, a broad 
similarity to the Wundt-Berlyne curve is present, and none of the 50000 randomized versions of the data 
showed an MI value greater than or equal to the value (0.0760) obtained for the true data. 
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Figure 7: (a) The cumulative probability, P(occurrence frequency > /) for keywords. The red line shows 
a fit corresponding to the cumulative probability for a stretched exponential distribution, exp(— A f*^) with 
parameters A = 1.0119 and /3 = 0.2716. (b) The frequency of keyword-pair occurrence as a function of the 
rank of the pair (blue). For the keyword-pair corresponding to each rank, the probability of occurrence 
under the independence assumption is shown by a red dot. For the 10 highest ranked keyword-pairs, the 
probabilities of occurrence under the independence assumption are indicated by red crosses 



Overall occurrence probabilities of keywords and keyword-pairs 

Next, we study the probability distribution of plot-keywords over the entire set of films in the period 
between 1890 and 2011. Unlike the distributions associated with other corpora [3l[25], the distribution does 
not appear to follow Zipf's law as seen from the curvature present in the log-log plot of the cumulative 
probability distribution of usage frequency (Fig. El^a)). Indeed, a stretched exponential fit obtained through 
maximum-likelihood-fitting [311 [35] agrees well with the data (parameters provided in caption of Fig. HJa)). 

Any non-trivial process of plot generation would result in some keyword-pairs occurring more often 
than expected by chance, and others less often. To probe whether this is indeed borne out by the data, we 
compare the occurrence frequency of keyword pairs to the frequency obtained under the assumption that the 
constituent keywords are chosen independently of each other, in proportion to their respective occurrence 
probabilities. The results shown in Fig. [7|^b) show a substantial difference between the true keyword-pair 
frequencies and those obtained under the independence assumption. 

Finally, we present a visual depiction (Fig. [8]) of the rise and fall of keywords that are associated with 
movies over the entire period from 1910 to 2011. Unlike a traditional time series plot (as in Fig. [3)^a)) 
streamgraphs introduced in |371 [38] provide a lucid graphical approach to simultaneously observing the 
growth and decline in the usage of different keywords (thickness of each "stream" ) , along with their relative 
usage in a given year (relative thickness of a stream in a cross section). 

A prominently visible feature in Fig.[8]^a) is the growth in the use of the keyword independent-film beyond 
1955, presumably resulting from the demise of the studio system and marking the period when studios began 
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Figure 8: Streamgraphs for most probable keywords occurring in (a) all films (b) action films and (c) 
science- fiction films. See Methods for details. 

forming partnerships with independent producers. Furthermore, until that time, the monopoly of the studios 
on the exhibition venues, strongly suppressed the visibility of independently produced films j23j . A notable 
feature in the action streamgraph (Fig. [8||b)) is the early dominance of the keyword b-movie and its decline 
in the 1950s. Indeed, between 1930 and 1950, action films mostly comprised of low-budget westerns created 
to fit the double feature programming format [39j. However, by the 1950s, with film audience numbers in 
decline as a result of the predominance of television, and with the end of the studio-system, the low-budget 
action film gradually declined in production and the genre as a whole underwent a redefinition in the 1960s 
[32]. 



Discussion 

We have demonstrated that user-generated keywords coarsely characterizing a film, can provide a quanti- 
tative window into the evolution of novelty in films over a 70 year period. Specifically, the novelty scores 
defined here reveal both subtle trends in overall novelty evolution (Fig. H]) and disruptive changes in the 
evolution of specific genres (Fig. [5]^a)). A notable feature of several evolution curves is that of either a 
saturation in novelty, or the achievement of a maximum in novelty during the 1960s (Fig. [ll^a),(b), and 
Fig. [5]^a),(b)). Presumably, this corresponds to the widely held thesis [23j that the break-up of the stu- 
dio system, the advent of competition from television, and the rise of several socio-political movements, all 
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contributed in varying measures to the 1960s becoming a defining decade in the history of American cinema. 

More generahy, our studies show that novelty in cinema, at least for the case of films produced in the 
United States, emerges in the form of an increase in combinatorial novelty, by repurposing existing elements 
in novel ways. This agrees well with the notion of "combinational creativity" [E], an accepted paradigm in 
the field of creativity and novelty generation. Secondly, we show that the relationship between novelty and 
aggregate reward, in the context of films, shows a Wundt-Berlyne like curve with maximum reward being 
achieved at intermediate novelty values. However, in addition we find that the variance of reward also shows 
a Wundt-Berlyne like behavior. 

While this study has focussed on utilizing keywords to observe aggregate trends, there are several possible 
extensions that can be pursued in future work. The first is to attempt a refinement of novelty scores 
which takes into account the descriptive level of the keyword, an issue that is ignored in this study. For 
example, here we treat a keyword characterizing a high-level feature related to the production (for example 
independent- film) equivalently to a keyword which specifies a story-element (for example murder) . A possible 
approach to alleviating this is by employing a probabilistic topic model like hierarchical latent Dirichlet 
allocation on the keyword set 03] , and then defining a more finely resolved measure of novelty based on the 
obtained hierarchy of topics. 

A second potential research direction is to analyze the utility of the novelty score discussed here or 
refinements of it to search and recommendation. Yet another application of such scores is in the area of arti- 
ficial or computer-aided story generation ^36j where ranking the novelty of plot-element combinations based 
on their prior probabilities could allow exploration in novel directions. Understanding aggregate novelty 
preferences may also provide insights into the viral spread and mass adoption (or lack thereof) of certain 
products and services, and is a research direction with valuable applications to marketing campaigns and 
social network based behavior-change initiatives. Furthermore, any venue offering the combined availability 
of crowdsourced data, the network between users providing tags, and their individual tagging behavior, 
provides the opportunity to segment the population on the basis of their novelty preferences, and design 
products and services tailored specifically to each segment. 

Methods 

Data collection and analysis: 

Data was obtained from IMDb (http://www.imdb.com/interfaces) as plain text data files in May 2012. Data 
was processed with Python scripts using the IMDbPY package (http://imdbpy.sourceforge.net/). First, all 
data items corresponding to films (not including straight-to-video releases, or TV movies) were extracted. 
Next, those items which had "Country" listed as 'USA' and "Language" listed as 'English' were extracted. 
Finally, all films with 'Adult', 'Short' or 'Documentary' under "Genre" were removed to leave us with the 
set under consideration. For more details, see Supplementary Text, Section 1.1. 

Detrended fluctuation analysis: 

Detrended fluctuation analysis for a time series y = {yi,y2, ■ ■ ■ yN} involves the following steps: 

(i) Mean-center the original time series: y = {yi — {y),y2 — {y)r " ^Un — iu)} where (y) = ^^ ' 

(ii) Generate a random walk z by summing up displacements corresponding to values in y: Zj = ^^11=1 Vi 

(iii) Partition the total number of steps in the walk (i.e., total number of elements in the original time 
series) into boxes of size L. 
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(iv) Within each box, compute the local trend z using a linear fit to the data. Compute the variance 
in the detrended fluctuations within each box and then compute the square root of its average over all 



boxes: o"(L) ~ ^/((^(t) — z)^) (where the (• • • ) corresponds to an average over boxes, and the term within 
corresponds to the variance within a box. 

(v) Repeat the process for different values of L and estimate the exponent a in the scaling (t(L) ~ L". 

Mutual Information estimation: 

The mutual information between random variables x and y with marginal distributions P{x) and P{y) 
respectively and joint distribution P(a;, y) is defined as: 



-En...,.o.(,^) 



x,y 

In the absence of a knowledge of a specific form for the relationship between variables, mutual information is 
a useful signifier of the presence or absence of dependencies between variables x and y [30] . The estimation of 
mutual information between two continuous variables with a finite number of observations is a well-studied 
problem. We utilize a method proposed in |4H B2] and an implementation of the same provided by Zbynek 
Koldovsky. 

Novelty and hedonic value 

Budgets and revenues generated from theatrical exhibition are present for 1680 films in the period under 
consideration. We adjust for infiation all dollar amounts that have a reporting year associated with them 
based on the cumulative price index table for the year 2011. To strike a balance between having a sufficiently 
large number of films to analyze, and minimizing the disparities in the exhibition capabilities of films 
considered, we restrict our analysis to films with a infiation adjusted budget of at least 1 million dollars (see 
SI, Section 1.2 for further details). Finally, to account for the fact that novelty as perceived by a general 
audience largely involves comparison to films released over a short period in the past (rather than the over 
the entire duration that cinema has been around), we compute Me and Mc for a film i, only considering 
films which were released in the 6 months preceding the month of its release. 

Streamgraphs 

A "stream" for a keyword was generated using the number of occurrences of the keyword for each year in the 
period. The resulting signal was smoothed using spline interpolation. A stacked graph was generated and 
to guarantee symmetry about the Y axis, the baseline was displaced in proportion to the total width of the 
stack as described in [571 [38]. Foi' Fig. El^a) we use the set of keywords obtained from the union of the most 
frequently used keyword for each year in the period. This set contains 9 unique keywords. For streamgraphs 
shown in Figs.El^b) and (c) for films belonging to the action and science-fiction genres respectively, keyword 
sets were chosen using a similar procedure as for Fig. [8)^a) but were additionally pruned to retain only the 
10 keywords with the highest average usage-frequency over the period. 
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1 Supplementary Text 

1.1 Details of the dataset used 

Data from the Internet Movie Database was obtained as plain text files from the Alternative Interfaces 
page: (http://www.imdb.com/interfaces). Data was downloaded in May 2012. Only data pertaining to films 
through 2011 was used. Data was processed using the Python IMDbPy module: (http://imdbpy.sourceforge.net/). 
We first extracted all films that contained 'USA' in the field 'Country'. This corresponded to all films pro- 
duced (or at least attributed to) the United States. Next, from the above set we extracted films that did 
not contain the terms 'Short', 'Adult' or 'Documentary' in its 'Genre' field. The number of films in this 
filtered set is 46596. The total number of films with keywords in this set is 33128 and the earliest film with 
a keyword has a release year of 1910. 

For Figs. 1, 2, 3, 7 and 8 in the main text, wherever keywords are considered, data from all films between 
1910-2011 was used. 

For Figs. 4, 5 and 6 in the main text, novelty was calculated for films released between 1929 and 1998, 
the period for which there is a fairly consistent degree of tagging (see Fig. 1(b) of main text). For Figs. 
4 and 5, when computing the novelty of a film i, the probability of keyword usage P^w) was calculated 
using all films with keywords that were released prior to it (the earliest being in 1910), as well as the film 
under consideration. For Fig. 6, when computing the novelty of a film, the probability of keyword usage 
was calculated using all films that appeared 6 months prior to the release month of the film, and the film 
under consideration. 

Furthermore, for Fig. 6, only films with an inflation adjusted budget greater than or equal to $ 1 million 
were considered (see next section for explanation). 

In Figs. 4,5 and 6, the films were additionally filtered to retain only those with greater than 10 keywords 
in their respective keyword sets. Eventually, the total number of films used for the results in Fig. 4 was 
13322, which constitutes approximately 62% of all films with keywords in the period 1929 to 1998. The 
total number of films used for Fig. 6 after filtering for keyword set lengths and budget was 1509. 

1.2 Choosing a proxy for hedonic value 

Obtaining a proxy for the hedonic value (or pleasure) obtained from a film is difficult using the available data. 
Here we choose to quantify it in an aggregate sense, using statistics of the film's popularity at an aggregate 
level. The IMDb-rating for the film is a natural candidate, but it is the result of an undisclosed weighting 
scheme employed to prevent ballot stuffing [1]. How effectively the ratings alleviate the problem is unclear, 
but even in the best-case scenario when voting is honest, the rating would reflect the taste preferences of 
only the registered users of IMDb. Perhaps, as a consequence of the idiosyncrasies in the computation of 
IMDb ratings, the relationship between novelty and rating shows scarce decipherable structure (see Fig. [1]). 

Instead of IMDb-rating, we use the revenue generated by the film through ticket sales as a measure of 
hedonic value. All revenues are adjusted for inflation using the consumer price index table for 2011. One of 
the drawbacks of utilizing this measure is its inability to capture distaste or negative values of hedonic value. 
It might appear that using a measure like return on investment (ROI) which is the profit (loss) divided by 
the production cost might alleviate this shortcoming. However, we argue that any individual or aggregate 
measure of hedonic value should be agnostic to the production cost of the film. In other words, given two 
films with equal viewership (i.e., equal ticket sale revenue), it is unreasonable to ascribe a higher reward to 
the film with the lower budget, simply on account of its lower production cost. 

Another possible argument in favor of incorporating budgets in the measure of reward is to counter the 
influence of budgets in drawing audiences; expensive films invariably have more theaters exhibiting them, 
thus suggesting that the production budget has a direct influence on viewership. However, an expensive 
film that performs poorly is liable to be removed from exhibition after the initial commitment period 
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(the minimum contractually obligated period for which a theater screens the film) has elapsed, and a 
relatively inexpensive film playing in only a few theaters could outperform it in terms of viewership if it 
garners sustained audience interest. Thus the relationship between production cost and viewership is not 
straightforward. However, in order to mitigate issues which arise due to such differences in exhibition 
capability we only use films that have (inflation adjusted) budgets above $1 million to obtain the results 
shown in Fig. 6 of the main text. 

2 Supplementary Figures 
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Supplementary Figure 1: Scatterplot of (a) Me and (b) J\fc versus IMDb rating for all films (with keywords) 
between 1929 and 1998. 



19 



Western 



:ac 




:ac 



Chronologically ordered films 



^-^^' 




t^ # ;?^ # 

■* *- 



Comedy 




Chronologically ordered films 



Chronologically ordered films 




Chronologically ordered films 



:ac 



Fantasy 




10 



^c 



Chronologically ordered films 




<p <;? <;p <;p <;? ^. 



5k ^ 



Chronologically ordered films 



Supplementary Figure 2: Trends for films based on appearance of different terms in their IMDb 'Genre' 
field. 
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Supplementary Figure 3: Standard deviation of revenue as a function of (a) elemental novelty and (b) 
combinatorial novelty, (c) Probability density function of revenue for films considered in results of Fig. 6. 
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Supplementary Figure 4: Scatterplot of revenue as a function of (a) elemental novelty and (b) combinatorial 
novelty from the data used for Fig. 6 in main text. The black curve on both plots indicate the 90 
percentiles of revenue for each binned value of novelty. Novelty values were binned over the interval shown 
into 100 bins. 
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